Reducing Redundancy with Anchor Text and Spam Priors
نویسندگان
چکیده
In this paper, we document our efforts in participating to the TREC 2011 Web Tracks. We had multiple aims: This year, tougher topics were selected for the Web Track, for which there is less popularity information available. We look at the relative value of anchor text for these less popular topics, and at impact of spam priors. Full-text retrieval on the ClueWeb09 B collection suffers from text spam, especially in the top 5 ranks. The spam prior largely reduces the impact of spam, leading to a boost in precision. We find that, in contrast to the more common queries of last year, anchor text does improve ad hoc retrieval performance of a full-text baseline for less common queries. However, for diversity, mixing anchor text and full-text leads to an improvement. Closer analysis reveals that mixing anchor text and full-text, fewer relevant nuggets are retrieved which cover more subtopics. Anchor text is an effective way of reducing redundancy and increasing coverage of subtopics at the same time.
منابع مشابه
Using Anchor Text, Spam Filtering and Wikipedia for Web Search and Entity Ranking
In this paper, we document our efforts in participating to the TREC 2010 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track we wanted to compare the effectiveness of anchor text of the category A and B collections and the impact of global document quality measures such as PageRank and spam scores. For the Entity Ranking Track, we use Wikipedia as a pivot to find relevant ent...
متن کاملVenue Recommendation and Web Search Based on Anchor Text
This paper presents the University of Amsterdam’s participation in TREC 2014. For the Contextual Suggestion Track, we experimented with the use of anchor text representations in the language modeling framework, and base our runs either on full ClueWeb12 or the subset of touristic aggregators (e.g., tripadvisor) provided by the organizers of the track. We also look at the effectiveness of priors...
متن کاملUniversity of Amsterdam at TREC 2014
This paper presents the University of Amsterdam’s participation in TREC 2014. For the Contextual Suggestion Track, we experimented with the use of anchor text representations in the language modeling framework, and base our runs either on full ClueWeb12 or the subset of touristic aggregators (e.g., tripadvisor) provided by the organizers of the track. We also look at the effectiveness of priors...
متن کاملAn Anti-spam Filter Combination Framework for Text-and-Image Emails through Incremental Learning
We present an anti-spam filtering framework that combines text-based and image-based anti-spam filters. First, an incremental learning approach to reducing mismatches between training and test datasets is proposed to resolve the problem of a lack of training data for legitimate emails that contain both text and images. Then, the outputs of text-based and image-based filters are combined with th...
متن کاملA Critical Analysis of Financial Fraud Spam in English in Terms of Persuasive Strategies: Personalization, Presupposition, and Lexical Choices
The term ‘spam’ addresses unsolicited emails sent in bulk; therefore, the term‘financial fraud spam’ refers to unwanted bulk emails in which different tricks and techniques areemployed to swindle money from the recipients. Estimates show that more than 80% of worldwideemail traffic in 2011 was spam. It should be noted that while the number of daily spam emails in2002 was 2.4 billion, this numbe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011